Efficient data selection for ASR

نویسندگان

  • Neil Kleynhans
  • Etienne Barnard
چکیده

Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many languages found in the developing world fall into the resource-scarce category and due to this resource scarcity the deployment of ASR systems in the developing world is severely inhibited. One approach to assist with resource-scarce ASR system development, is to select “useful” training samples which could reduce the resources needed to collect new corpora. In this work, we propose a new data selection framework which can be used to design a speech recognition corpus. We show for limited data sets, independent of language and bandwidth, the most effective strategy for data selection is frequencymatched selection and that the widely-used maximum entropy methods generally produced the least promising results. In our model, the frequency-matched selection method corresponds to a logarithmic relationship between accuracy and N. T. Kleynhans CSIR, Meraka Institute, HLT group & North-West University CSIR Site Building 43 Meiring Naude Road Brummeria Pretoria South Africa Tel.: +27-12-8414264 Fax: +27-12-8414720 E-mail: [email protected] E. Barnard MuST group & North-West University Vaal Triangle Campus Van Eck Blvd Vanderbijlpark 1900 Tel.: +27-16-910-3111 Fax: +27-16-910-3116 E-mail: [email protected] 2 Neil Taylor Kleynhans, Etienne Barnard corpus size; we also investigated other model relationships, and found that a hyperbolic relationship (as suggested from simple asymptotic arguments in learning theory) may lead to somewhat better performance under certain conditions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

In Search of Optimal Data Selection for Training of Automatic Speech Recognition Systems

This paper presents an extended study in the topic of optimal selection of speech data from a database for efficient training of ASR systems. We reconsider a method of optimal selection introduced in our previous work and introduce variosearch as an alternative selection method developed in order to find a representative sample of speech data with a simultaneous control of acoustical and statis...

متن کامل

Efficient Assessment of Asr Systems by Using Subsets of a Test Database

In this paper, assessment of ASR systems with a limited set of speech data selected from a larger testing corpus was studied for connected Dutch digits. Three methods of data selection were applied, namely random, knowledge-based, and datadriven selection. The goal of this study was to find out whether reliable assessment of speech recognition systems can be achieved by using a small sample of ...

متن کامل

Optimizing Data Selection for Automatic Speech Recognition in Low Resource Languages

Developing Automatic Speech Recognition (ASR) systems for low resource languages is a labor-, computation-, and timeintensive task. Data selection techniques seek highly informative subsets of speech data for transcription and can lead to considerable reduction in time and expense for transcription and ASR training. This project investigates unsupervised and supervised data selection techniques...

متن کامل

Efficient codebooks for fast and accurate low resource ASR systems

Today, speech interfaces have become widely employed in mobile devices, thus recognition speed and resource consumption are becoming new metrics of Automatic Speech Recognition (ASR) performance. For ASR systems using continuous Hidden Markov Models (HMMs), the computation of the state likelihood is one of the most time consuming parts. In this paper, we propose novel multi-level Gaussian selec...

متن کامل

APPLICATION OF DEA FOR SELECTING MOST EFFICIENT INFORMATION SYSTEM PROJECT WITH IMPRECISE DATA

The selection of best Information System (IS) project from many competing proposals is a critical business activity which is very helpful to all organizations. While previous IS project selection methods are useful but have restricted application because they handle only cases with precise data. Indeed, these methods are based on precise data with less emphasis on imprecise data. This paper pro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Language Resources and Evaluation

دوره 49  شماره 

صفحات  -

تاریخ انتشار 2015